Abstractive summarization is a text that can summarize
the content of the original text and have smooth sentences
after understanding the content of the article through
computer algorithm technology. It is not some existing
paragraphs or sentences extracted from the source file, but
a compressed interpretation of the main content of the
document, which may use words that cannot be seen in the
source document [8]. Sutskever et al. [9] first proposed a
neural network-based sequence to sequence (seq2seq)
model; Vaswani et al. [10] proposed a new simple network
structure transformer, which is completely based on
attention mechanism and provides a powerful algorithm
tool for natural language processing; Tan et al. [11]
proposed a new graph-based attention mechanism.
Compared with the previous neural abstract model, this
model can achieve considerable improvement. Hao et al.
[12] proposed a feature enhanced seq2seq structure
summarization model. The model uses two feature capture
networks to improve the encoder and decoder in the
traditional seq2seq structure, enhance the ability of the
model to capture and store long-term features and global
features, make the generated summarization information
richer and smoother. Devlin et al. [13] put forward the pre-
training model of Bert bidirectional transformer encoder,
which makes full use of the context semantic information,
and has achieved better results in many downstream tasks
of natural language processing. Radford et al. [14]
proposed GPT unidirectional transformer encoder model,
which adopts pre-training model method and fine-tuning
downstream tasks to process NLP tasks like BERT, thus
improving the score of related downstream NLP tasks.
Then, based on the two pre-training models, there are
many fusion algorithm models to deal with the NLP task
of automatic summarization. Tan et al. [15] proposed the
BERT-PGN model to solve the problem of insufficient
understanding of generative sentence context, integrated
the pointer generation network into the best. The summary
obtained from the experiment has improved the rouge-2
and rouge-4 indicators. For NLU tasks such as sentiment
classification, named entity recognition, reading
comprehension, etc., the above-mentioned BERT model
has achieved good results. However, in NLP field, for
sequence-to-sequence natural language generation tasks,
such as machine translation, text summarization
generation, dialogue generation, etc., only achieve
suboptimal results. The strategy of BERT is to train its
encoder for NLP to solve this problem, Kaitao Song et al.
[16] trained the encoder and decoder separately and
proposed the MASS model, which allows the encoder and
decoder to learn at the same time in the pre-training stage.
It is the first time to realize the unification of the BERT
plus generation model, and the rouge score is improved
compared with the BERT and other models.
However, the abstracts generated by machine learning
or deep learning model may deviate from the theme of the
article. The abstracts focus too much on the non key
thematic sentence particles in the original article. For
example, in a long news story that contains a question-and-
answer style of characters' narration and events, the
abstractive summarization may concentrate on
summarizing the dialogue content of the characters and
ignore the core feedback points of the whole event. The
BART-TextRank model proposed in this paper introduces
the key sentences extracted from the source of the data set,
which makes the model better absorb the key sentence
particles of the long article, makes the generated abstract
more able to reflect the topic of the article and summarize
the main idea of the article.
III. BART-TEXTRANK MODEL
Based on the original BART model, the proposed
BART- TextRank model introduces the TextRank method
to extract the key sentences from the data set, extends the
abstractive summarization of BART model to multiple
rounds. The summarization generated in the first round is
fused with the extracted key sentence text, and then put the
new dataset into BART model again to complete the
second round of abstractive summarization. This method
is divided into the following seven steps: the first step and
the second step are to input the text sequence into the
TextRank layer and the BART layer at the same time.
After the third and fourth steps, we can get the TextRank
summarization and the BART summarization. In the
TextRank layer, we first vectorize the sentences to get the
sentence vector, and then calculate the score of each
sentence through the similarity matrix and graph model.
Take the top sentences as the summarization. In the BART
layer, [CLS] marks the beginning of the sentence and [SEP]
marks the end of the sentence, with each word token in the
middle. Then the word is embedded and converted into a
vector, and then the predicted words are formed by a
bidirectional encoder and a decoder from left to right to
form a sentence as a summary. In the fifth step, the above
two summarization results are combined to form a new
data set. Currently, the enhanced data set contains the key
sentence particles of the article. In the sixth step, put it into
the BART layer again for processing, and in the seventh
step, the final summarization is obtained. The network
structure of the model is shown in Figure 1 below:
A. Key Sentence Extraction
The first step of BART-TextRank model is to input text
sequence into TextRank layer to extract topic sentences and
extract key sentence particles. TextRank algorithm is an
extractive text summarization algorithm. The core of the
algorithm is to introduce the concept of graph sorting. Its
basic framework is derived from Google's PageRank
algorithm. By dividing an article into several sentence level
or word level group nodes into units, using each unit to
establish the overall graph model, the similarity between
units constitutes the connection nodes of the graph model
Finally, the top n sentences are extracted as the key
sentences to form the final summarization by sorting the
accumulated weights.
Authorized licensed use limited to: Hong Kong University of Science and Technology. Downloaded on July 19,2023 at 12:15:46 UTC from IEEE Xplore. Restrictions apply.